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Job Scheduling for the BlueGene/L System 



Abstract. BlueGene/L is a massively parallel cellular architecture sys- 
tem with a toroidal interconnect. Cellular architectures with a toroidal 
interconnect are effective at producing highly scalable computing sys- 
tems, but typically require job partitions to be both rectangular and 
contiguous. These restrictions introduce fragmentation issues that af- 
fect the utilization of the system and the wait time and slowdown of 
queued jobs. We propose to solve these problems for the BlueGene/L sys- 
tem through scheduling algorithms that augment a baseline first come 
first serve (FCFS) scheduler. Restricting ourselves to space-sharing tech- 
niques, which constitute a simpler solution to the requirements of cellular 
computing, we present simulation results for migration and backfilling 
techniques on BlueGene/L. These techniques are explored individually 
and jointly to determine their impact on the system. Our results demon- 
strate that migration can be effective for a pure FCFS scheduler but 
that backfilling produces even more benefits. We also show that migra- 
tion can be combined with backfilling to produce more opportunities to 
better utilize a parallel machine. 

1 Introduction 

BlueGene/L (BG/L) is a massively parallel cellular architecture system. 65,536 
self-contained computing nodes, or cells, are interconnected in a three-dimen- 
sional toroidal pattern [19]. In that pattern, each cell is directly connected to its 
six nearest neighbors, two each along the x, y, and z axes. Three-dimensional 
toroidal interconnects are simple, modular, and scalable, particularly when com- 
pared with systems that have a separate, typically multistage, interconnection 
network [13]. Examples of successful toroidal-interconnected parallel systems in- 
clude the Cray T3D and T3E machines [11]. 

There is, however, a price to pay with toroidal interconnects. We cannot 
view the system as a simple fully-connected interconnection network of nodes 
that are equidistant to each other (i.e., a flat network). In particular, we lose an 
important feature of systems like the IBM RS/6000 SP, which lets us pick any 
set of nodes for execution of a parallel job, irrespective of their physical location 
in the machine [1]. In a toroidal- interconnected system, the spatial allocation 
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of nodes to jobs is of critical importance. In most toroidal systems, including 
BG/L, job partitions must be both rectangular (in a multidimensional sense) 
and contiguous. It has been shown by Feitelson and Jette [7] that, because of 
these restrictions, significant machine fragmentation occurs in a toroidal system. 
Fragmentation results in low system utilization and high wait time for queued 
jobs. 

In this paper, we analyze a set of strictly space- sharing scheduling tech- 
niques to improve system utilization and reduce the wait time of jobs for the 
BG/L system. Time-sharing techniques such as gang-scheduling are not explored 
since these types of schedulers require more memory and operating system in- 
volvement than are practically available in a cellular computing environment. We 
analyze the two techniques of backfilling [i>, 14, 17) and migration [3, 20) in the 
context of a toroidal-interconnected system. Backfilling is a technique that moves 
lower priority jobs ahead of other higher priority jobs, as long as execution of 
the higher priority jobs is not delayed. Migration moves jobs around the toroidal 
machine, performing on-the-fly defragmentation to create larger contiguous free 
space for waiting jobs. 

We conduct a simulation- based study of the impact of our scheduling al- 
gorithms on the system performance of BG/L. Using actual job logs of super- 
computing centers, we measure the impact of migration and backfilling as en- 
hancements to a first-come first- serve (FCFS) job scheduling policy. Migration is 
shown to be effective in improving maximum system utilization while enforcing 
a strict FCFS policy. We also find that backfilling, which bypasses the FCFS 
order, can lead to even higher utilization and lower wait times. Finally, we show 
that there is a small benefit from combining backfilling and migration. 

The rest of this paper is organized as follows. Section 2 discusses the schedul- 
ing algorithms used to improve job scheduling on a toroidal-interconnected par- 
allel system. Section 3 describes the simulation procedure to evaluate these algo- 
rithms and presents our simulation results. Section 4 describes related work and 
suggests future work opportunities. Finally, Section 5 presents the conclusions. 

2 Scheduling Algorithms 

System utilization and average job wait time in a parallel system can be improved 
through better job scheduling algorithms [■!, 5, 7, 0, 10. 12, 14, 15, 11), 17, 21, 
22, !>(>]. The opportunity for improvement over a simple first-come first-serve 
(FCFS) scheduler is much greater for toroidal interconnected systems because 
of the fragmentation issues discussed in Section 1. The following section describes 
four job scheduling algorithms that we evaluate in the context of BG/L. In all 
algorithms, arriving jobs are first placed in a queue of waiting jobs, prioritized 
according to the order of arrival. The scheduler is invoked for every job arrival 
and job termination event in order to schedule new jobs for execution. 

Scheduler 1: First Come First Serve (FCFS). For FCFS, we adopt the heuristic 
of traversing the waiting queue in order and scheduling each job in a way that 
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maximizes the largest free rectangular partition remaining in the torus. For each 
job of size p y we try all the possible rectangular shapes of size p that fit in the 
torus. For each shape, we try all the legal allocations in the torus that do not 
conflict with running jobs. Finally, we select the shape and allocation that results 
in the maximal largest free rectangular partition remaining after allocation of 
this job. We stop when we find the first job in the queue that cannot be scheduled. 

A valid rectangular partition does not always exist for a job. There are job 
sizes which are always impossible for the torus, such as prime numbers greater 
than the largest dimension size. Because job sizes are known at job arrival time, 
before execution, jobs with impossible sizes are modified to request the next 
largest possible size. Additionally, there are legal job sizes that cannot be sched- 
uled because of the current state of the torus. Therefore, if a particular job of 
size p cannot be scheduled, but some free partition of size q > p exists, the job 
will be increased in size by the minimum amount required to schedule it. For 
example, consider a 4 x 4 (two-dimensional) torus with a single free partition of 
size 2 x 2. If a user submits a job requesting 3 nodes, that job cannot be run. 
The scheduler increases the job size by one, to 4, and successfully schedules the 
job. 

Determining the size of the largest rectangular partition in a given three- 
dimensional torus is the most time- intensive operation required to implement 
the maximal partition heuristic. When considering a torus of shape M x M x 
M, a straightforward exhaustive search of all possible partitions takes 0(M g ) 
time. We have developed a more efficient algorithm that computes incremental 
projections of planes and uses dynamic programming techniques. This projection 
algorithm has complexity 0(M 5 ) and is described in Appendix A. 

An FCFS scheduler that searches the torus in a predictable incremental fash- 
ion, implements the maximal partition heuristic, and modifies job sizes when 
necessary is the simplest algorithm considered, against which more sophisticated 
algorithms are compared. 

Scheduler 2: FCFS With Backfilling. Backfilling is a space-sharing optimization 
technique. With backfilling, we can bypass the priority order imposed by the job 
queuing policy. This allows a lower priority job j to be scheduled before a higher 
priority job i as long as this reschedule does not delay the estimated start time 
of job i. 

The effect of backfilling on a particular schedule for a one-dimensional ma- 
chine can be visualized in Figure 1. Suppose we have to schedule five jobs, 
numbered from 1 to 5 in order of arrival. Figure 1(a) shows the schedule that 
would be produced by a FCFS policy without backfilling. Note the empty space 
between times T\ and T2. while job 3 waits for job 2 to finish. Figure 1(b) shows 
the schedule that would be produced by a FCFS policy with backfilling. The 
empty space was filled with job 5, which can be executed before job 3 without 
delaying it. 

The backfilling algorithm seeks to increase system utilization without job 
starvation. It requires an estimation of job execution time, which is usually 
not very accurate. However, previous work [8, 18. 23] has shown that overesti- 
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Fig. 1. FCFS policy without (a) and with (b) backfilling. Job numbers corre- 
spond to their position in the priority queue 



mating execution time does not significantly affect backfilling results. Backfill- 
ing has been shown to increase system utilization in a fair manner on an IBM 
RS/6000 SP [8, 23). 

Backfilling is used in conjunction with the FCFS scheduler and is only invoked 
when there are jobs in the waiting queue and FCFS halts because a job does not 
fit in the torus. A reservation time for the highest-priority job is then calculated, 
based on the worst case execution time of jobs currently running in the torus. 
The reservation guarantees that the job will be scheduled no later than that 
time, and if jobs end earlier than expected the reservation time may improve. 
Then, if there are additional jobs in the waiting queue, a job is scheduled out 
of order so long as it does not prevent the first job in the queue from being 
scheduled at the reservation time. Jobs behind the first job, however, may be 
delayed. 

Just as the FCFS scheduler dynamically increases the size of jobs that cannot 
be scheduled with their current size, similar situations may arise during backfill- 
ing. Unlike FCFS, however, the size increase is performed more conservatively 
during backfilling because there are other jobs in the queue which might better 
utilize the free nodes of the torus. Therefore, a parameter / specifies the max- 
imum size by which the scheduler will increase a job. For example, by setting 
1 = 1 (our default value), backfilling increases a job size by at most one node. 
This parameter is used only during the backfilling phase of scheduling; the FCFS 
phase will always increase the first job in the queue as much as is required to 
schedule it. 

Scheduler 3: FCFS With Migration. The migration algorithm rearranges the 
running jobs in the torus in order to increase the size of the maximal contiguous 
rectangular free partition. Migration in a toroidal-interconnected system com- 
pacts the running jobs and counteracts the effects of fragmentation. 

While migration does not require any more information than FCFS. it may 
require additional hardware and software functionality. This paper does not at- 
tempt to quantify the overhead of that functionality. However, accepting that 
this overhead exists, migration is only undertaken when the expected benefits 
are deemed substantial. The decision to migrate is therefore based on two pa- 
rameters: FN tor, the ratio of free nodes in the system compared to the size of 
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the torus, and FN maxj the fraction of free nodes contained in the maximal free 
partition. In order for migration to establish a significant larger maximal free 
partition, FN tOT must be sufficiently high and FN inax must be sufficiently low. 
Section 3.4 contains further analysis of these parameters. 

The migration process is undertaken immediately after the FCFS phase fails 
to schedule a job in the waiting queue. Jobs already running in the torus are 
organized in a queue of migrating jobs sorted by size, from largest to smallest. 
Each job is then reassigned a new partition, using the same algorithm as FCFS 
and starting with an empty torus. After migration, FCFS is performed again in 
an attempt to start more jobs in the rearranged torus. 

In order to ensure that all jobs fit in the torus after migration, job sizes 
are not increased if a reassignment requires a larger size to fit in the torus. 
Instead, the job is removed from the queue of migrating jobs, remaining in 
its original partition, and reassignment begins again for all remaining jobs in 
the queue. If the maximal free partition size after migration is worse than the 
original assignment, which is possible but generally infrequent under the current 
scheduling heuristics, migration is not performed. 

Scheduler 4- FCFS with Backfilling and Migration. Backfilling and migration 
are independent scheduling concepts, and an FCFS scheduler may implement 
both of these functions simultaneously. First, we schedule as many jobs as pos- 
sible via FCFS. Next, we rearrange the torus through migration to minimize 
fragmentation, and then repeat FCFS. Finally, the backfilling algorithm from 
Scheduler 2 is performed to make a reservation for the highest- priority job and 
attempt to schedule jobs with lower priority so long as they do not conflict with 
the reservation. The combination of these policies should lead to an even more 
efficient utilization of the torus. For simplicity, we call this scheduling technique, 
that combines backfilling and migration, B+M. 

3 Experiments 

We use a simulation- based approach to perform quantitative measurements of 
the efficiency of the proposed scheduling algorithms. An event-driven simulator 
was developed to process actual job logs of supercomputing centers. The results 
of simulations for all four schedulers were then studied to determine the impact 
of their respective algorithms. We begin this section with a short overview of the 
BG/L system. We then describe our simulation environment. We proceed with 
a discussion of the workload characteristics for the two job logs we consider. 
Finally, we present the experimental results from the simulations. 

3.1 The BlueGene/L System 

The BG/L system is organized as a 32 x 32 x 64 three-dimensional torus of nodes 
(cells). Each node contains processors, memory, and links for interconnecting to 
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its six neighbors. The unit of allocation for job execution in BG/L is a 512- 
node ensemble organized in an 8 x 8 x 8 configuration. This allocation unit is 
the smallest granularity for which the torus can be electrically partitioned into 
a toroidal topology. Therefore, BG/L behaves as a 4 x 4 x 8 torus of these 
supernodes. We use this supernode abstraction when performing job scheduling 
for BG/L. 

3.2 The Simulation Environment 

The simulation environment models a torus of 128 (super)nodes in a three- 
dimensional 4x4x8 configuration. The event-driven simulator receives as input 
a job log and the type of scheduler (FCFS, Backfill, Migration, or B-f M) to 
simulate. There are four primary events in the simulator: (1) an arrival event 
occurs when a job is first submitted for execution and placed in the scheduler's 
waiting queue; (2) a schedule event occurs when a job is allocated onto the torus, 
(3) a start event occurs after a standard delay of one second following a schedule 
event, at which time a job begins to run, and (4) a finish event occurs upon 
completion of a job, at which point the job is deallocated from the torus. The 
scheduler is invoked at the conclusion of every event that affects the states of 
the torus or the waiting queue (i.e., the arrival and finish events). 

A job log contains information on the arrival time, execution time, and size of 
all jobs. Given a torus of size TV, and for each job j the arrival time execution 
time tj and size -Sj, the simulation produces values for the start time £j and 
finish time t* of each job. These results are analyzed to determine the following 
parameters for each job: (1) wait time tj = t*-t?, (2) response time t^ = t^ — tj, 
and (3) bounded slowdown t^ s = ^x(t%r) for T = 10 seconds. The r term 
appears according to recommendations in [8], because some jobs have very short 
execution time, which may distort the slowdown. 

Global system statistics are also determined. Let the simulation time span 
be T = max V:7 — min V fc We then define system utilization (also called 
capacity utilized) as 

Similarly, let f(t) denote the number of free nodes in the torus at time t and 
g(t) denote the total number of nodes requested by jobs in the waiting queue 
at time t. Then, the total amount of unused capacity in the system, u> U nused> is 
defined as: 



^^r^ ^smz^m ^ (2) 



This parameter is a measure of the work unused by the system because there 
is a lack of jobs requesting free nodes. The max term is included because the 
amount of unused work cannot be less than zero. The balance of the system ca- 
pacity is lost despite the presence of jobs that could have used it. The measure of 
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Table 1. Statistics for 10,000- job NASA and SDSC logs 



Job size restrictions: 



Number of nodes: 



NASA Ames iPSC/860 log SDSC IBM RS/6000 SP log 
128 128 
powers of 2 none 



Job size (nodes) 
Mean: 

Standard deviation: 



6.3 
14.4 



9.7 
14.8 



Workload (node-seconds) 
Mean: 

Standard deviation: 



0.881 x 10 6 
5.41 x 10 6 



7.1 x 10 6 
25.5 x 10 6 



lost capacity in the system, which includes capacity lost because of the inability 
to schedule jobs and the delay before a scheduled job begins, is then derived as: 



3.3 Workload Characteristics 

We performed experiments on a 10,000-job span of two job logs obtained from 
the Parallel Workloads Archive [6]. The first log is from NASA Ames's 128- 
node iPSC/860 machine (from the year 1993). The second log is from the San 
Diego Supercomputer Center's (SDSC) 128-node IBM RS/6000 SP (from the 
years 1998-2000). For our purposes, we will treat each node in those two systems 
as representing one supernode (512-node unit) of BG/L. This is equivalent to 
scaling all job sizes in the log by 512, which is the ratio of the number of nodes 
in BG/L to the number of nodes in these 128-node machines. Table 1 presents 
the workload statistics and Figure 2 summarizes the distribution of job sizes and 
the contribution of each job size to the total workload of the system. Using these 
two logs as a basis, we generate logs of varying workloads by multiplying the 
execution time of each job by a coefficient c, mostly varying c from 0.7 to 1.4 in 
increments of 0.05. Simulations are performed for all scheduler types on each of 
the logs. With these modified logs, we plot wait time and bounded slowdown as 
a function of system utilization. 

3.4 Simulation Results 

Figures 3 and 4 present plots of average job wait time (£j) and average job 
bounded slowdown (t*- s ), respectively, vs system utilization (u; uti i) for each of 
the four schedulers considered and each of the two job logs. We observe that the 
overall shapes of the curves for wait time and bounded slowdown are similar. 

The most significant performance improvement is attained through backfill- 
ing, for both the NASA and SDSC logs. Also, for both logs, there is a certain 
benefit from migration, whether combined with backfilling or not. We analyze 
these results from each log separately. 
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Fig. 2* Job sizes and total workload for NASA Ames iPSC/860((a) and (c)) and 
San Diego Supercomputer Center (SDSC) IBM RS/6000 SP((b) and (d)) 
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(a) NASA iPSC/860 (b) SDSC RS/6000 SP 

Fig. 3. Mean job wait time vs utilization for (a) NASA and (b) SDSC logs 



NASA log: All four schedulers provide similar average job wait time and av- 
erage job bounded slowdown for utilizations up to 65%. The FCFS scheduler 
saturates at about 77% utilization, whereas the Migration scheduler saturates 
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Mean job boundad ttowito~n V3 Uu&xation M»an fob bounded tkwdown Us*zanon 




(a) NASA iPSC/SGO (b) SDSC RS/6000 SP 
Fig. 4. Mean job bounded slowdown vs utilization for (a) NASA and (b) SDSC 
logs . 

at about 80% utilization. Backfilling (with or without migration) allows utiliza- 
tions above 80% and saturates closer to 90% (the saturation region for these 
schedulers is shown here by plotting values of c > 1.4). We note that migration 
provides only a small improvement in wait time and bounded slowdown for most 
of the utilization range, and the additional benefits of migration with backfilling 
becomes unpredictable for utilization values close to the saturation region. In the 
NASA log, all jobs are of sizes that are powers of two, which results in a good 
packing of the torus. Therefore, the benefits of migration are limited. 

SDSC log: With the SDSC log, the FCFS scheduler saturates at 63%, while 
the stand-alone Migration scheduler saturates at 73%. In this log, with jobs 
of more varied sizes, fragmentation occurs more frequently. Therefore, migra- 
tion has a much bigger impact on FCFS, significantly improving the range, of 
utilizations at which the system can operate. However, we note that when back- 
filling is used there is again only a small additional benefit from migration, more 
noticeable for utilizations between 75 and 85%. Utilization above 85% can be 
achieved, but only with exponentially growing wait time and bounded slowdown, 
independent of performing migration. 

Figure 5 presents a plot of average job bounded slowdown (t£ s ) vs system uti- 
lization (ttf u tii) for each of the four schedulers considered and each of the two job 
logs. We also include results from the simulation of a fully- connected (flat) ma- 
chine, with and without backfilling. (A fully-connected machine does not suffer 
from fragmentation.) This allows us to assess the effectiveness of our schedulers 
in overcoming the difficulties imposed by a toroidal interconnect. The overall 
shapes of the curves for wait time are similar to those for bounded slowdown. 

Migration by itself cannot make the results for a toroidal machine as good 
as those for a fully connected machine. For the SDSC log, in particular, a fully 
connected machine saturates at about 80% utilization with just the FCFS sched- 
uler. For the NASA log, results for backfilling with or without migration in the 
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toroidal machine are just as good as the backfilling results in the fully connected 
machine. For utilizations above 85% in the SDSC log, not even a combination of 
backfilling and migration will perform as well as backfilling on a fully connected 
machine. 

Figure 6 plots the number of migrations performed and the average time 
between migrations vs system utilization for both workloads. We show results for 
the number of total migrations attempted, the number of successful migrations, 
and the maximum possible number of successful migrations (max successful). As 
described in Section 2, the parameters which determine if a migration should be 
attempted are FN tor , the ratio of free nodes in the system compared to the size 
of the torus, and i*7V mai) the fraction of free nodes contained in the maximal 
free partition. According to our standard migration policy, a migration is only 
attempted when FN tor > 0.1 and FNmax < 0.7. A successful migration is defined 
as a migration attempt that improves the maximal free partition size. The max 
successful value is the number of migrations that are successful when a migration 
is always attempted (i.e., FN tor > 0.0 and FN max < 1.0). 

Almost all migration attempts were successful for the NASA log. This prop- 
erty of the NASA log is a reflection of the better packing caused by having jobs 
that are exclusively power of two in size. For the SDSC log, we notice that many 
more total attempts are made while about 80% of them are successful. If we 
always try to migrate every time the state of the torus is modified, no more than 
20% of these migrations are successful, and usually much less. 

For the NASA log, the number of migrations increases linearly while the 
average time between these migrations varies from about 90 to 30 minutes, de- 
pending on the utilization level and its effect on the amount of fragmentation in 
the torus. In contrast to the NASA log, the number of migrations in the SDSC 
log do not increase linearly as utilization levels increase. Instead, the relationship 
is closer to an elongated bell curve. As utilization levels increase, at first migra- 
tion attempts and successes also increase slightly to a fairly steady level. Around 
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the first signs of saturation the migrations tend to decrease (i.e., at around 70% 
utilization for the Migration scheduler and 77% for B-hM). Even though the 
number of successful migrations is greater for the SDSC log, the average time 
between migrations is still longer as a result of the larger average job execution 
time. 

Most of the benefit of migration is achieved when we only perform migration 
according to our parameters. Applying these parameters has three main advan- 
tages: we reduce the frequency of migration attempts so as not to always suffer 
the required overhead of migration, we increase the percentage of migration at- 
tempts that are successful, and additionally we increase the average benefits of 
a successful migration. This third advantage is apparent when we compare the 
mean job wait time results for our standard FN tor and F7V max settings to that 
of the scheduler that always attempts to migrate. Even though the maximum 
possible number of successful migrations is sometimes twice as many as pur ac- 
tual number of successes, Figure 7 reveals that the additional benefit of these 
successful migrations is very small. 




(a) NASA iPSC/860 (b) SDSC RS/6000 SP 




(c) NASA iPSC/860 (d) SDSC RS/6000 SP 
Fig. 6. Number of total, successful, and maximum possible successful migrations 
vs utilization ((a) and (b)), and average time between migrations vs utilization 
((c) and (d)) 
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(a) NASA iPSC/860 (b) SDSC RS/6000 SP 
Fig. 7. Mean job wait time vs utilization for the NASA and SDSC logs, compar- 
ing the standard migration policy to a full migration policy that always attempts 
to migrate 



We complete this section with an analysis of results for system capacity uti- 
lized, unused capacity, and lost capacity. The results for each scheduler type 
and both standard job logs (c = 1.0) are plotted in Figure 8. The utilization 
improvements for the NASA log are barely noticeable - again, because its jobs 
fill the torus more compactly. The SDSC log, however, shows the greatest im- 
provement when using B+M over FCFS, with a 15% increase in capacity utilized 
and a 54% decrease in the amount of capacity lost. By themselves, the Backfill 
and Migration schedulers each increase capacity utilization by 15% and 13%, 
respectively, while decreasing capacity loss by 44% and 32%, respectively These 
results show that B+M is significantly more effective at transforming lost capac- 
ity into unused capacity. Under the right circumstances, it should be possible to 
utilize this unused capacity more effectively. 

4 Related and Future Work 

The topics of our work have been the subject of extensive previous research. In 
particular. [8, 14, IT] have shown that backfilling on a fiat machine like the IBM 
RS/6000 SP is an effective means of improving quality of service. The benefits 
of combining migration and gang-scheduling have been demonstrated both for 
flat machines [24, '25) and toroidal machines like the Cray T3D [7]. The results 
in [7] are particularly remarkable, as system utilization was improved from 33%, 
with a pure space-sharing approach, to 96% with a combination of migration 
and gang-scheduling. The work in [21] discusses techniques to optimize spatial 
allocation of jobs in mesh-connected multicomputers, including changing the job 
size, and how to combine spatial- and time-sharing scheduling algorithms. An ef- 
ficient job scheduling technique for a three-dimensional torus is described in [2]. 
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(a) NASA iPSC/860 (b) SDSC RS/6000 SP 
Fig. 8. Capacity utilized, lost, and unused as a fraction of the total system 
capacity 



This paper, therefore, builds on this previous research by applying a combina- 
tion of backfilling and migration algorithms, exclusively through space-sharing 
techniques, to improve system performance on a toroidal-interconnected system. 

Future work opportunities can further build on the results of this paper. The 
impact of different FCFS scheduling heuristics for a torus, besides the largest free 
partition heuristic currently used, can be studied. It is also important to iden- 
tify how the current heuristic relates to the optimal solution in different cases. 
Additional study of the parameters /, FN tor , and ,FW max may determine fur- 
ther tradeoffs associated with partition size increases and more or less frequent 
migration attempts. Finally, while we do not attempt to implement complex 
time-sharing schedulers such as those used in gang-scheduling, a more limited 
time-sharing feature may be beneficial. Preemption, for example, allows for the 
suspension of a job until it is resumed at a later time. These time-sharing tech- 
niques may provide the means to further enhance the B+M scheduler and make 
the system performance of a toroidal- interconnected machine more similar to 
that of a flat machine. 

5 Conclusions 

We have investigated the behavior of various scheduling algorithms to deter- 
mine their ability to increase processor utilization and decrease job wait time 
in the BG/L system. We have shown that a scheduler which uses only a back- 
filling algorithm performs better than a scheduler which uses only a migration 
algorithm, and that migration is particularly effective under a workload that 
produces a large amount of fragmentation (i.e., when many small to mid-sized 
jobs of varied sizes represent much of the workload). Migration has a significant 
implementation overhead but it does not require any additional information be- 
sides what is required by the FCFS scheduler. Backfilling, on the other hand, 
does not have a significant implementation overhead but requires additional in- 
formation pertaining to the execution time of jobs. 
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Simulations of FCFS, backfilling, and migration space-sharing scheduling al- 
gorithms have shown that B-fM, a scheduler which implements all of these algo- 
rithms, shows a small performance improvement over just FCFS and backfilling. 
However, B-f M does convert significantly more lost capacity into unused capacity 
than just backfilling. Additional enhancements to the B-hM scheduler may har- 
ness this unused capacity to provide further system improvements. Even with the 
performance enhancements of backfilling and migration techniques, a toroidal- 
interconnected machine such as BG/L can only approximate the job scheduling 
efficiency of a fully connected machine in which all nodes are equidistant. 
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A Projection of Partitions (POP) Algorithm 

In a given three-dimensional torus of shape MxMxikf where some nodes have 
been allocated for jobs, the POP algorithm provides a 0(M 5 ) time algorithm 
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for determining the size of the largest free rectangular partition. This algorithm 
is a substantial improvement over an exhaustive search algorithm that takes 
0(M 9 ) time. 

Let FREEPART = {(B y S) \ B is a base location (i, j, k) and S is a partition 
size (a,6,c) such that V x,y,z, i < x < (i + a), j < y < {j + 6), k < z < 
(fc+c), node (x mod M, 2/ mod M, z mod M) is free}. POP narrows the scope of 
the problem by determining the largest rectangular partition P 6 FREEPART 
rooted at each of the M 3 possible base locations and then deriving a global 
maximum. Given a base location, POP works by finding the largest partition 
first in one dimension, then by projecting adjacent one-dimensional columns 
onto each other to find the largest partition in two dimensions, and iteratively 
projecting adjacent two-dimensional planes onto each other to find the largest 
partition in three dimensions. 

First, a partition table of the largest one-dimensional partitions P 6 
FREEPART is pre-computed for all three dimensions and at every possible base 
location in <9(M 4 ) time. This is done by iterating through each partition and 
whenever an allocated node is reached, all entries for the current "row" may be 
filled in from a counter value, where the counter is incremented for each adjacent 
free node and reset to zero whenever an additional allocated node is reached. 

For a given base location {i,j,k), we fix one dimension (e.g., k), start a 
counter X = i in the next dimension, and multiply X by the minimum partition 
table entry of the third dimension for (x mod M, j, k), where x varies as i < x < 
X and X varies as i < X < (i + M). As the example in Figure 9 shows, when X 
= 1 for some fixed k at base location (1,2, k) the partition table entry in the Y 
dimension will equal 3 since there are 3 consecutive free nodes, and our largest 
possible partition size is initially set to 3. When X increases to 2, the minimum 
table entry becomes 2 because of the allocated node at location (2,4,/c) and the 
largest possible partition size is increased to 4. When X = 3, we calculate a new 
largest possible partition size of 6. Finally, when we come across a partition table 
entry in the Y dimension of 0 because of the allocated node at location (4, 2, fc), 
we stop increasing X. We would also have to repeat a similar calculation along 
the Y dimension, by starting a counter Y. 




Fig. 9. 2-dimensional POP Algorithm applied to Base Location (1,2): Adjacent 
1 -dimensional columns are projected onto each other as X is incremented 





54 



Elie Krevat et al. 



Finally, this same idea is extended to work for 3 dimensions. Given a similar 
base location (i>j,k), we start a counter Z in the Z dimension and calculate 
the maximum two-dimensional partition given the current value of Z. Then we 
project the adjacent two-dimensional planes by incrementing Z and calculating 
the largest two-dimensional partition while using the minimum partition table 
entry of the X and Y dimensions for (i : j, z mod M), where z varies as k < z < 
Z. 

Using the initial partition table, it takes O(M) time to calculate a projection 
for two adjacent planes and to determine the largest two-dimensional partition. 
Since there are O(M) projections required for 0(M 3 ) base locations, our final 
algorithm runs in 0(M 5 ) time. 

When we implemented this algorithm in our scheduling simulator, we 
achieved a significant speed improvement. For the original NASA log, scheduling 
time improved from an average of 0.51 seconds for every successfully scheduled 
job to 0.16 seconds, while the SDSC log improved from an average of 0.125 
seconds to 0.063 seconds. The longest time to successfully schedule a job also 
improved from 38 seconds to 8.3 seconds in the NASA log, and from 50 seconds 
to 8.5 seconds in the SDSC log. 



BNSDOCID: <XP. 



.2336423A_L> 



